Data storage utilization is continually increasing, causing the proliferation of storage systems in data centers. Monitoring and managing these systems require increasing amounts of human resources. Information technology (IT) organizations often operate reactively, taking action only when systems reach capacity or fail, at which point performance degradation or failure has already occurred. Hard disk failures fall into one of two basic classes: predictable failures and unpredictable failures. Predictable failures result from slow processes such as mechanical wear and gradual degradation of storage surfaces. Monitoring can determine when such failures are becoming more likely. Unpredictable failures happen suddenly and without warning. They range from electronic components becoming defective to a sudden mechanical failure (perhaps due to improper handling).
Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T., or simply written as SMART) is a monitoring system for computer hard disk drives to detect and report on various indicators of reliability, in the hope of anticipating failures. When a failure is anticipated by S.M.A.R.T., the user may choose to replace the drive to avoid unexpected outage and data loss. The manufacturer may be able to use the S.M.A.R.T. data to discover where faults lie and prevent them from recurring in future drive designs. However, not all of the S.M.A.R.T. attributes can consistently provide reliable indications of possible disk failures. The S.M.A.R.T. attributes tend to vary and they may have different interpretation from one hard disk vendor or configuration to another. There has been a lack of reliable mechanism to determine which of the S.M.A.R.T. attributes to be the best disk failure indicator, as well as the efficient ways to predict single disk failure or multi-disk failures in a redundant array independent disks (RAID) environment.